Introduction to Triton Programming: From Eager Operators to Block-Based Parallelism

Transitioning from PyTorch Eager Mode to Triton requires a shift from viewing tensors as monolithic objects to viewing them as collections of discrete, manageable blocks or tiles.

1. PyTorch vs. Triton Tensors

It is vital to distinguish Triton tensors from PyTorch tensors. A PyTorch tensor is a host-side Python object wrapping shape, dtype, device, strides, and storage metadata. In contrast, Triton works with the raw data pointers within specific memory blocks, allowing for much lower-level optimization.

2. The Eager Bottleneck

In standard eager execution, every operation (e.g., Addition then ReLU) requires a separate kernel launch and a global memory round-trip. This is the primary bottleneck in modern GPU computing. Triton overcomes this by fusing operations within a single kernel that processes blocks of data (e.g., 128, 256, or 512 elements) directly in on-chip memory.

3. The Block-Based Paradigm

Instead of the scalar-level thinking of CUDA threads, Triton uses SPMD (Single Program, Multiple Data) at the block level. You write one kernel, and Triton launches multiple instances across a grid. Each instance uses its program_id to calculate which "chunk" of memory it owns.

4. Environment Setup

To begin, install Triton in a clean environment (using Conda or venv) to ensure no dependency conflicts with existing CUDA toolkits: pip install triton.

TERMINAL bash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary difference between a PyTorch tensor and a Triton tensor within a kernel?

Triton tensors contain Python metadata like strides; PyTorch tensors are raw pointers.

A PyTorch tensor is a host-side object wrapping metadata; a Triton tensor represents blocks of data processed at the compiler level.

There is no difference; they are the same object.

Triton tensors are stored on the CPU, while PyTorch tensors are on the GPU.

QUESTION 2

Why is 'Eager Mode' considered a bottleneck for modern GPU performance?

Because it uses too much CPU memory.

Every operation requires a separate kernel launch and a global memory round-trip.

It cannot handle floating-point numbers.

It lacks support for the Python language.

QUESTION 3

What is the result of installing Triton in a 'dirty' environment with conflicting CUDA toolkits?

Triton will automatically fix the CUDA path.

It may lead to library version mismatches and kernel compilation errors.

The GPU will run faster due to multiple toolkit options.

Triton does not use CUDA, so there is no conflict.

QUESTION 4

Draw the mapping from pid to index range for N=1000, BLOCK_SIZE=256.

pid 0: [0, 256); pid 1: [256, 512); pid 2: [512, 768); pid 3: [768, 1000)

pid 0: [0, 1000)

pid 0: [0, 256); pid 1: [257, 512); pid 2: [513, 768); pid 3: [769, 1000)

pid 1: [0, 256); pid 2: [256, 512); pid 3: [512, 768); pid 4: [768, 1000)

QUESTION 5

In block-based parallelism, the instruction shift moves from 'compute one element' to:

'Compute one entire tensor'.

'Compute one block of 128/256/512 elements'.

'Compute one scalar at a time'.

'Let the CPU handle the math'.